Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Extend KNN neighbor search beyond coincident sites #287

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

sjsrey
Copy link
Member

@sjsrey sjsrey commented May 3, 2020

This is aimed at pysal/pysal#1060.

If the approach makes sense, it can be extended to Kernel weights and DistanceBand weights. The latter will need slightly different handling for symmetry.

@sjsrey sjsrey requested a review from ljwolf May 3, 2020 18:47
@codecov
Copy link

codecov bot commented May 3, 2020

Codecov Report

Merging #287 into master will increase coverage by 0.19%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #287      +/-   ##
==========================================
+ Coverage   80.99%   81.19%   +0.19%     
==========================================
  Files         115      115              
  Lines       11580    11713     +133     
==========================================
+ Hits         9379     9510     +131     
- Misses       2201     2203       +2     
Impacted Files Coverage Δ
libpysal/weights/distance.py 86.71% <100.00%> (+1.67%) ⬆️
libpysal/weights/tests/test_distance.py 97.18% <100.00%> (+0.09%) ⬆️
libpysal/cg/alpha_shapes.py 85.00% <0.00%> (-1.67%) ⬇️
libpysal/weights/tests/test_weights.py 99.65% <0.00%> (+<0.01%) ⬆️
libpysal/weights/tests/test_util.py 98.15% <0.00%> (+0.01%) ⬆️
libpysal/weights/weights.py 78.92% <0.00%> (+0.90%) ⬆️
libpysal/cg/tests/test_ashapes.py 96.49% <0.00%> (+1.03%) ⬆️
libpysal/weights/util.py 77.13% <0.00%> (+1.54%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 14c7509...523b07f. Read the comment docs.

@sjsrey sjsrey requested review from darribas, jGaboardi and knaaptime May 3, 2020 18:59
@ljwolf
Copy link
Member

ljwolf commented May 4, 2020

I would super prefer #285... is there something wrong with #285?

EDIT: Yes, I (a) forgot to deal with ids and (b) tested on a scenario where th number of coincident points was usually quite small relative to k, so you always got the self-index in the output. This is now resolved.

@ljwolf
Copy link
Member

ljwolf commented May 4, 2020

OK, to be clear, this does more than simply ensure that co-incident points are assigned valid (co-incident) neighbors, it forces the neighbors of co-incident points to be outside of the coincident set, correct?

duplicates = duplicated(self.data)
coincident = duplicates[:,1].any()

self.duplicates = duplicates
Copy link
Member

@ljwolf ljwolf May 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that we'd want all distance-style weights objects gain a duplicates attribute?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thinking was to let the user know, somehow, that they have coincident points - duplicates is a bad choice in this regard as that might be taken to mean the records are identical in all attributes, whereas I think by coincident implies spatial duplicates.

@ljwolf ljwolf changed the title [WIP] Handle coincident points in KNN weights [WIP] Extend KNN neighbor search beyond coincident sites May 4, 2020
@ljwolf
Copy link
Member

ljwolf commented May 4, 2020

This method either ignores or fails on ids.

import libpysal, numpy, uuid

coordinates = numpy.random.random(size=(5,2))
coordinates = numpy.row_stack([coordinates]*5)

ids = [uuid.uuid4().hex for _ in range(100)] 

#fails, since we pass ids as id_order to resolve sorting issues from the dataframe
w = libpysal.weights.KNN(coordinates, id_order=ids) 
# ignores the ids because of libpysal#284
w = libpysal.weights.KNN(coordinates, ids=ids) 

tbf I think that's not really this issue's fault, but we'd need to address it.

for row in duplicate_ids[0]:
neighbors[row] = neighbors[duplicates[row, 2]]
n = self.data.shape[0]
ids = list(range(n))
Copy link
Member

@ljwolf ljwolf May 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This forces ids to be 0,n-1. The other part of this if statement, where we keep coincident points, does not do this.... regardless, it means that this fails when we use KNN.from_dataframe on something with an index...

@ljwolf
Copy link
Member

ljwolf commented May 4, 2020

I wanted to add a remove_coincident=True argument that allows users to keep the coincident points but safely remove the self-neighbor. But, I can't really figure out how to get the ids/id_order stuff fixed without adding further changes.

@pedrovma
Copy link
Member

pedrovma commented May 4, 2020

Just so I can understand, what happens now if you have different apartment units in a building (same coordinates)? Would they be considered neighbors of each other or will they be assigned neighbors in other buildings?

@sjsrey
Copy link
Member Author

sjsrey commented May 4, 2020

I would super prefer #285... is there something wrong with #285?

EDIT: Yes, I (a) forgot to deal with ids and (b) tested on a scenario where th number of coincident points was usually quite small relative to k, so you always got the self-index in the output.

My bad was I didn't see #285 until after I did this.

@sjsrey
Copy link
Member Author

sjsrey commented May 4, 2020

OK, to be clear, this does more than simply ensure that co-incident points are assigned valid (co-incident) neighbors, it forces the neighbors of co-incident points to be outside of the coincident set, correct?

Yes, that was the intention.

@sjsrey
Copy link
Member Author

sjsrey commented May 4, 2020

Just so I can understand, what happens now if you have different apartment units in a building (same coordinates)? Would they be considered neighbors of each other or will they be assigned neighbors in other buildings?

This would have them assigned neighbors with non-zero distances.

There are two (at least) cases where this happens - the one you point out where the issue is the data lacks sufficient spatial information to spatially disambiguate units in the same apartment building, and the other is with repeat sales data. In the latter, the data may have a temporal attribute that can be used to differentiate the records.

@ljwolf
Copy link
Member

ljwolf commented May 4, 2020

the data may have a temporal attribute that can be used to differentiate the records

Ah, brilliant! never even thought that you could just send (n,3) arrays into the constructor!

That'd be very strongly sensitive to scaling, right? For data measured in UTM at a monthly timeframe, your inter-temporal neighbors are much "closer" than your spatial neighbors, right? UTM coordinates are typically in thousands of meters, whereas months will be 0...11.

@ljwolf
Copy link
Member

ljwolf commented May 4, 2020

I think I'd want to see

I also don't see how this needs to be ported to kernel/distance band weights, save for estimating the bandwidth of a kernel?

@sjsrey
Copy link
Member Author

sjsrey commented May 4, 2020

the data may have a temporal attribute that can be used to differentiate the records

Ah, brilliant! never even thought that you could just send (n,3) arrays into the constructor!

That'd be very strongly sensitive to scaling, right? For data measured in UTM at a monthly timeframe, your inter-temporal neighbors are much "closer" than your spatial neighbors, right? UTM coordinates are typically in thousands of meters, whereas months will be 0...11.

Just to be clear, I wasn't thinking that they would be passing in (n,3) as the data for the knn search, with one of the three being the temporal coordinate. That would raise all kinds of scaling issues that you point out.

The motivation for the approach in the PR was that coincident spatial points violate the law (I think it was Keith Clarke's) that no two events can occupy the same point at the same time. In data sets where at temporal attribute is missing, the coincident points imply this happens.

Maybe the suggested approach should be an option, rather than the default. Another option would be to jigger the coordinates of the coincident points prior to the search - but that also comes with some possible side-effects.

@ljwolf
Copy link
Member

ljwolf commented May 4, 2020

Maybe the suggested approach should be an option, rather than the default.

Yes, I think that's appropriate. I was trying to add a remove_coincident=False above and was running into the ids/id_order issues.

Another option would be to jigger the coordinates of the coincident points prior to the search

Sure, that'd be reasonable. I think the solution implemented in #285 though is just fine, and better than reducing the precision of the data through jittering.

@ljwolf
Copy link
Member

ljwolf commented Dec 10, 2021

I'd like to merge #285 to fix the problem, and then return to this as a "ignore_coincident" option for the constructor...

@martinfleis martinfleis changed the base branch from master to main February 27, 2023 08:47
@knaaptime
Copy link
Member

i think this is resovled by graph and can be closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants